S+: Efficient 2D Sparse LU Factorization on Parallel Machines
نویسندگان
چکیده
Static symbolic factorization coupled with supernode partitioning and asynchronous computation scheduling can achieve high giga op rates for parallel sparse LU factorization with partial pivoting This paper studies properties of elimination forests and uses them to optimize supernode partitioning amalgamation and execution scheduling It also proposes supernodal matrix multiplication to speed up kernel computation by retaining the BLAS level e ciency and avoiding unnecessary arithmetic operations The experiments show that our new design with proper space optimization called S improves our previous solution substantially and can achieve up to GFLOPS on Cray T E MHz nodes Introduction The solution of sparse linear systems is a computational bot tleneck in many scienti c computing problems When dynamic pivoting is required to maintain numerical stability in direct methods for solving non symmetric linear systems it is challenging to develop high performance parallel code because pivoting causes severe caching miss and load imbalance on modern architectures with mem ory hierarchies The previous work has addressed parallelization on shared mem ory platforms or with restricted pivoting Most notably the recent shared memory implementation of SuperLU has achieved up to GFLOPS on Cray C nodes For distributed memory machines we proposed an ap proach that adopts a static symbolic factorization scheme to avoid data structure variation Static symbolic factorization eliminates the runtime overhead of dynamic symbolic factorization with a price of over estimated ll ins and thereafter extra computation However the static data structure allowed us to identify data regularity maximize the use of BLAS operations and utilize task graph scheduling techniques and e cient run time support to achieve high e ciency This paper addresses three issues to further improve the performance of paral lel sparse LU factorization with partial pivoting on distributed memory machines First we study the use of elimination trees in optimizing matrix partitioning and task scheduling Elimination trees or forests are used extensively in sparse Cholesky factorization because they have a more compact representation of paral lelism than task graphs For sparse LU factorization the traditional approach uses the elimination tree of AA which can produce excessive false computational depen dency In this paper we use elimination trees forest of A to guide matrix partitioning and parallelism control in LU factorization We show that improved supernode par titioning and amalgamation e ectively control extra ll ins and produce optimized supernodal partitioning We also use elimination forests to identify data dependence and potential concurrency among pivoting and updating tasks and thus maximize utilization of limited parallelism Second we propose a fast and space e cient kernel for supernode based matrix This work was supported in part by NSF CCR and by DARPA through UMD ONR Contract Number N C yDepartment of Computer Science University of California at Santa Barbara CA USA kshen cs ucsb edu zDepartment of Computer Science University of California at Santa Barbara CA USA tyang cs ucsb edu xDepartment of Computer Science University of Illinois at Urbana Champaign IL USA
منابع مشابه
Efficient Sparse LU Factorization with Partial Pivoting on Distributed Memory Architectures
A sparse LU factorization based on Gaussian elimination with partial pivoting (GEPP) is important to many scientific applications, but it is still an open problem to develop a high performance GEPP code on distributed memory machines. The main difficulty is that partial pivoting operations dynamically change computation and nonzero fill-in structures during the elimination process. This paper p...
متن کاملParallel Sparse LU Factorization with Partial Pivoting on Distributed Memory Architectures
Gaussian elimination based sparse LU factorization with partial pivoting is important to many scientiic applications, but it is still an open problem to develop a high performance sparse LU code on distributed memory machines. The main diiculty is that partial pivoting operations make structures of L and U factors unpredictable beforehand. This paper presents an approach called S for paralleliz...
متن کاملEecient Sparse Lu Factorization with Partial Pivoting on Distributed Memory Architectures
A sparse LU factorization based on Gaussian elimination with partial pivoting (GEPP) is important to many scientiic applications, but it is still an open problem to develop a high performance GEPP code on distributed memory machines. The main diiculty is that partial pivoting operations dynamically change computation and nonzero ll-in structures during the elimination process. This paper presen...
متن کاملParallel Direct Solution of Linear Equations on FPGA-Based Machines
The efficient solution of large systems of linear equations represented by sparse matrices appears in many tasks. LU factorization followed by backward and forward substitutions is widely used for this purpose. Parallel implementations of this computation-intensive process are limited primarily to supercomputers. New generations of Field-Programmable Gate Array (FPGA) technologies enable the im...
متن کاملA PERFORMANCE STUDY OF SPARSE CHOLESKY FACTORIZATION ON INTEL iPSC/860
The problem of Cholesky factorization of a sparse matrix has been very well investigated on sequential machines. A number of efficient codes exist for factorizing large unstructured sparse matrices, for example, codes from Harwell Subroutine Library [4] and Sparspak [7]. However, there is a lack of such efficient codes on parallel machines in general, and distributed memory machines in particul...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- SIAM J. Matrix Analysis Applications
دوره 22 شماره
صفحات -
تاریخ انتشار 2000